Summary

Note: In all cases, the number of notifications is per month, and the percentages are out of the total number of users who received any notifications, or the total number of notifications received.

The distribution of notifications seems largely to follow a long-tailed power law distribution. Almost all notified users receive very few (about 95% get from 1 to 4), while a very small number of highly-active users account for a large number of notifications (with about 0.1% receiving 10-15% of the total notifications).

However, the distributions did vary somewhat between the wikis, falling into three main groups:

  • English Wikipedia and Japanese Wikipedia, which I expect represent "normal" large wikis.
  • Commons, where the number of highly-notified users is particularly small (98.5% of users receive 1 to 4 notifications, and 0.1% receive 25 or more).
  • French Wikipedia and Chinese Wikipedia, where Flow is used heavily. Both these wikis have a somewhat larger proportion of users with many notifications (about 0.9% of their users receiving 30 or more per month, compared with about 0.4% for English and Japanese).

Users generally don't build up large numbers of unread notifications; for example, out of the 1203 enwiki users who had 30 or more notifications total, only 3.7% had 30 or more unread notifications.

I suggest the following conclusions based on this data:

  • It's not worth investing significant resources into notification management tools beyond better bundling until Flow or other notification-heavy products are more widely used.
  • It may be worth investigating the small group of users who receive a very high number of notifications. For example, how many are bots? How many of the human users feel overwhelmed by their notifications?

Data collection

Choice of wikis

I chose to look at notification use on five different wikis:

  • Commons (commonswiki)
  • English Wikipedia (enwiki)
  • Japanese Wikipedia (jawiki),
  • French Wikipedia (frwiki)
  • Chinese Wikipedia (zhwiki).

Commons, enwiki, and jawiki represent "normal" large wikis; on the other hand, frwiki and zhwiki are relatively heavy users of Flow (frwiki at its central discussion board and zhwiki on user talk pages). Flow generates a large number of notifications but is not in use at many wikis; in making heavy use of notifications, it represents the likely future direction of MediaWiki software.

SQL

The data was generated from the Echo extension's database tables using the following queries:

Total notifications

SELECT 
    DATABASE() as "wiki",
    `notifications`,
     COUNT(*) as "users"
FROM
(
    SELECT
        COUNT(*) as "notifications"
    FROM echo_notification
    WHERE
        notification_timestamp > "20160219" AND
        notification_timestamp < "20160321" AND
        notification_bundle_base = 1
    GROUP BY notification_user
) notifications_by_user
GROUP BY `notifications`;

Unread notifications

SELECT
    DATABASE() as "wiki",
    `unread notifications`,
     COUNT(*) as "users"
FROM
(
    SELECT
        SUM( IF( notification_read_timestamp IS NULL, 1, 0) ) as "unread notifications"
    FROM echo_notification
    WHERE
        notification_timestamp > "20160219" AND
        notification_timestamp < "20160321" AND
        notification_bundle_base = 1
    GROUP BY notification_user
) notifications_by_user
GROUP BY `unread notifications`;

I gathered the results from the 5 different wikis using multiquery with the following command:

$ multiquery notifications_per_user.sql --dbnames=notifications_dbs.tsv --host=x1-analytics-slave.eqiad.wmnet --defaults-file=~/.my.cnf > ~/notifications_per_user.tsv

I ran the unread query on 28 March, so the unread counts reflect notifications from the month which were still unread about 7 days after it ended.

Notification bundling

The queries include notification_bundle_base = 1 to exclude "bundled" notifications, which (1) don't behave as a standalone notifications from the user's point of view and (2) are never directly marked as read in the database (instead their read status is computed

There are downsides to this approach. The bundled notifications are real notifications (although they're generally less interesting to the user than stand-alone notifications), and omitting them will undercount notification activity. However, the largest purpose of this study is to understand whether users are currently overloaded with notifications, so we shouldn't ignore bundling's effect in reducing that load. In addition, the number of unread notifications is an important measure in this study, and the logic used to determine whether a bundled notification has been read is too complex to reimplement here.

Accounting for bundling has a dramatic effect unread notifications; for example, a previous version of this study found that 19% of enwiki users with at least 25 notifications had at least 25 unread notifications. After excluding bundled notifications, that figure went down to 2%.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as ax

%matplotlib inline

In [2]:
notifs = pd.read_table("./notifications_per_user.tsv")
unreads = pd.read_table("./unread_notifications_per_user.tsv")
wikis = set(notifs["wiki"])

notifs.tail()


Out[2]:
wiki notifications users
590 zhwiki 187 1
591 zhwiki 190 1
592 zhwiki 325 1
593 zhwiki 454 1
594 zhwiki 525 1

In [3]:
def filter_by_wiki( df, wiki ):
    return df[ df["wiki"] == wiki ].iloc[:, 1:3]

def plot_by_wiki( df, wiki, range = (5, 104), bins = 20, ax = plt ): 
    dist = filter_by_wiki( df, wiki )
    ax.hist( dist.iloc[:, 0], bins = bins, range = range, weights = dist.iloc[:, 1])
    ax.set_title(wiki)
    ax.set_xlabel( "Number of notifications" )
    ax.set_ylabel( "Users" )

Total notifications

This data covers all the users who received at least one notification during the month, whether they actually visited the site during month or not, so we'd expect that the numbers are dominated by a large bulk of users with very few notifications, and that there's a long tail of very few users with a very large number of notifications.

But let's characterize that a bit. At each wiki, how many and what percent of users with any notifications got fewer than 5?


In [4]:
def beyond_threshold(df, wikis, threshold, direction):
    columns = [
    "wiki",
    "users",
    "% of users",
    "% of notifications"
    ]
    
    results = []
    
    for wiki in wikis:
        by_wiki = filter_by_wiki(df, wiki)
        total_users = by_wiki.iloc[:, 1].sum()
        
        total_notifs = 0 
        for row in by_wiki.iterrows():
            total_notifs += row[1][0] * row[1][1]
        
        if direction == "under":
            beyond_threshold = by_wiki[ by_wiki.iloc[:, 0] < threshold ]
        elif direction == "over":
            beyond_threshold = by_wiki[ by_wiki.iloc[:, 0] > threshold ]
        
        users_beyond_threshold = beyond_threshold.iloc[:, 1].sum()
        notifs_beyond_threshold = 0
                
        for row in beyond_threshold.iterrows():
            notifs_beyond_threshold += row[1][0] * row[1][1]
        
        user_proportion = users_beyond_threshold / total_users
        notifs_proportion = notifs_beyond_threshold / total_notifs
        
        results.append([
            wiki, 
            users_beyond_threshold,
            round(user_proportion * 100, 1),
            round(notifs_proportion * 100, 1)
        ])
    
    results = pd.DataFrame(results, columns=columns)
    return results

beyond_threshold( notifs, wikis, 5, "under")


Out[4]:
wiki users % of users % of notifications
0 zhwiki 9459 95.9 53.2
1 enwiki 240282 96.2 63.9
2 commonswiki 61302 99.0 86.7
3 jawiki 8754 94.6 62.1
4 frwiki 30477 95.0 55.6

5 or more?


In [5]:
beyond_threshold(notifs, wikis, 4, "over")


Out[5]:
wiki users % of users % of notifications
0 zhwiki 405 4.1 46.8
1 enwiki 9524 3.8 36.1
2 commonswiki 635 1.0 13.3
3 jawiki 501 5.4 37.9
4 frwiki 1607 5.0 44.4

And what percent of users got 25 notifications or more—becoming more or less "daily notified"?


In [6]:
beyond_threshold(notifs, wikis, 24, "over")


Out[6]:
wiki users % of users % of notifications
0 zhwiki 96 1.0 33.1
1 enwiki 1474 0.6 21.3
2 commonswiki 102 0.2 8.2
3 jawiki 61 0.7 18.4
4 frwiki 382 1.2 32.0

That's lower than I expected at English Wikipedia. It only had about 1,200 users with at least 30 notifications per month, compared to 3,500 highly active users (100+ edits) per month. However, both Flow wikis have higher percentages than the non-Flow wikis.

Now, let's look at the actual distributions. To make it easier to comprehend, I'll cut off the 90%+ of users with fewer than 5 notifications. I'll also cut off the users with 100 or more. How many is that?


In [7]:
beyond_threshold(notifs, wikis, 99, "over")


Out[7]:
wiki users % of users % of notifications
0 zhwiki 17 0.2 15.5
1 enwiki 248 0.1 9.5
2 commonswiki 15 0.0 3.6
3 jawiki 7 0.1 8.2
4 frwiki 59 0.2 14.7

Graphs


In [8]:
fig, axarr = plt.subplots( 5, 1, figsize=(12,30) )
fig.suptitle("Total notifications per user", fontsize=24)
fig.subplots_adjust(top=0.95)
i = 0
for wiki in wikis:
    plot_by_wiki(notifs, wiki, ax = axarr[i])
    i = i + 1


So, as expected, all the wikis have a pretty regular power-law distribution of notifications.

Unread notifications

First, the counts and percentages for various levels of unread notifications.

Under 5


In [9]:
beyond_threshold(unreads, wikis, 5, "under")


Out[9]:
wiki users % of users % of notifications
0 zhwiki 9815 99.5 84.0
1 enwiki 249290 99.8 95.6
2 commonswiki 61908 100.0 98.6
3 jawiki 9202 99.4 93.6
4 frwiki 32003 99.7 95.7

5 or more


In [10]:
beyond_threshold(unreads, wikis, 4, "over")


Out[10]:
wiki users % of users % of notifications
0 zhwiki 49 0.5 16.0
1 enwiki 513 0.2 4.4
2 commonswiki 26 0.0 1.4
3 jawiki 53 0.6 6.4
4 frwiki 81 0.3 4.3

25 or more


In [11]:
beyond_threshold(unreads, wikis, 24, "over")


Out[11]:
wiki users % of users % of notifications
0 zhwiki 14 0.1 10.8
1 enwiki 53 0.0 2.3
2 commonswiki 5 0.0 1.1
3 jawiki 2 0.0 0.8
4 frwiki 13 0.0 2.3

100 or more


In [12]:
beyond_threshold(unreads, wikis, 99, "over")


Out[12]:
wiki users % of users % of notifications
0 zhwiki 1 0.0 2.3
1 enwiki 11 0.0 1.2
2 commonswiki 2 0.0 0.8
3 jawiki 0 0.0 0.0
4 frwiki 0 0.0 0.0

Histograms


In [13]:
fig, axarr = plt.subplots( 5, 1, figsize=(12,30) )
fig.suptitle("Unread notifications per user", fontsize=24)
fig.subplots_adjust(top=0.95)
i = 0
for wiki in wikis:
    plot_by_wiki(unreads, wiki, ax = axarr[i])
    i = i + 1